De novo transcriptomes of cave and surface isopod crustaceans: insights from 11 species across three suborders

Isopods are a diverse group of crustaceans, that inhabit various environments, including terrestrial, freshwater, and marine, both on the surface and in the underground. The biological mechanisms underlying their wide range of adaptations to diverse ecological niches remain elusive. In order to unravel the molecular basis of their adaptability, we generated a comprehensive RNAseq dataset comprising 11 isopod species belonging to the three different suborders: freshwater Asellota, marine, brackish and freshwater Sphaeromatidea, and terrestrial Oniscidea, with representatives from families Asellidae, Sphaeromatidae, and Trichoniscidae, respectively. Representatives of each family were collected from both cave and surface environments, representing at least three independent cave colonization events. Three biological replicates were sequenced from each species to ensure data robustness. The 11 high-quality RNAseq datasets will serve as a valuable resource for understanding cave-specific adaptations, comparative and functional genomics, ecological annotation as well as aid in conservation efforts of these non-model organisms. Importantly, transcriptomes of eight featured species have been made publicly accessible for the first time.


Background & Summary
Isopods are a diverse group of terrestrial and aquatic crustaceans with over 10,000 described species, exhibiting a wide range of ecological adaptations 1 .They live in marine, brackish, and freshwater environments, and one lineage also conquered the land.Furthermore, they are one of the most abundant animal groups in caves, with multiple independent colonisations of both aquatic and terrestrial cave habitats throughout the world 2,3 .Understanding the molecular mechanisms that underlie their adaptations to multiple environments is essential for elucidating their ecological success.
Transcriptomics has emerged as an irreplaceable tool across biological disciplines.It is especially important for obtaining molecular sequence information of non-model organisms and provides a foundation for advancing our understanding of genetic diversity, population dynamics, structural variations, selective pressures, and adaptive traits in these species.However, the availability of genomic and transcriptomic resources for isopods is limited, with only a few species having their genomes sequenced or transcriptomes deposited in public databases.Currently, only a few genomes are available, from well-studied terrestrial species, such as Armadillidium vulgare (Latreille, 1804) 4 , Ligia exotica Roux, 1828 5 and Trachelipus rathkii (Brandt, 1833) 6 , but also a giant deep-sea isopod, Bathynomus jamesi Kou, Chen & Li, 2017 7 .Transcriptome data for isopods are more readily available, although isopods represent a diverse group of crustaceans, and the availability of genomic resources varies among different families, genera, and species 8 .For example, just a few species from the families Sphaeromatidae and Trichoniscidae have transcriptomes sequenced but none are cave dwelling 8,9 .Conversely, within Asellidae several RNAseq studies have been published, both on cave and surface representatives of the genus Proasellus 10,11 and the Asellus aquaticus (Linnaeus, 1758) [12][13][14][15] .Altogether, the genomic data for isopods is limited, especially in comparison to other crustaceans (amphipods and decapods) 16 or other arthropods like insects.
In this study, we focus on 11 isopod species from three different suborders and families: Proasellus coxalis s.l.(Dollfus, 1892), P. karamani Remy, 1934, P. anophtalmus dalmatinus (Karaman, 1955) and P. hercegovinensis (Karaman, 1933)   (Frankenberger, 1937) and Titanethes albus (C.Koch, 1841) from family Trichoniscidae (Oniscidea), collected from both cave and surface environments.Utilizing high-throughput RNA sequencing, we generated high-quality de novo transcriptomes for each species, with eight of them being sequenced for the first time.This dataset has numerous applications as it enables transcriptomic studies of various isopod species, to explore specific aspects of their biology, adaptation, and ecology.It provides a molecular framework for understanding how isopods from distinct families respond to environmental challenges, as it enables identification of genes and pathways which are involved in cave-specific adaptations.Comparative analyses can reveal both conserved and lineage-specific genes, thus shedding light on the evolutionary history of these crustaceans.Additionally, this dataset can be utilized for ecological annotation of unknown transcripts.The pace at which Next-Generation Sequencing (NGS) datasets are generated is in stark contrast to our current understanding of the functions of the genes they uncover, which remains a significant challenge in genomics and molecular biology.Ecological annotation is an increasingly relevant approach, bridging the gap between gene sequences and their ecological roles.Finally, considering that isopods play a crucial role in many ecosystems, contributing to nutrient cycling and decomposition, fluctuations in their populations can reflect broader ecosystem health.Isopod transcriptomic studies can aid in monitoring and assessing the impacts of environmental disturbances on these ecosystems, and ultimately guide conservation actions.

Methods
Work-flow.The overall process, starting with sampling live individuals from nature to de novo transcriptome assemblies and all the downstream analysis using various bioinformatics tools and validation steps is summarised in Fig. 1.

Sample collection.
Species were selected to obtain independent replicates of subterranean colonization events across three different isopod families.Specimens were collected from cave and surface environments across various geographical locations (Table 1, Fig. 2), using tweezers, brushes, transfer pipettes, large pipettes (turkey basters), nets, and aspirators.Sampling sites for Asellidae were chosen based on a comprehensive search in WAD (World Asellidae Database) [17][18][19] database.The selection of localities for Sphaeromatidae and Trichoniscidae was based on an internal database of the Croatian Biospeleological Society.The species were identified according to morphological criteria based on descriptions/redescriptions and identification keys [20][21][22][23][24][25][26][27] .Taxonomic arrangement and nomenclature follow Boyko et al. 1 .
Sampled individuals were placed in containers with native water, or, in case of terrestrial species, in a plastic container with a layer of plaster 28 and transported to the laboratory.Cave species were maintained in complete darkness, whereas surface species were housed in 12:12 hours light:dark cycle.After several weeks or months in laboratory, specimens were randomly selected, and starved for several days.Individuals were flash frozen in liquid nitrogen and stored at −80 °C until further processing.For details on the biology, distribution and ecology of selected species see Lukić et al. 28 .
RNa extraction and sequencing.Total RNA was extracted from 9 whole individuals per species, in total 99 specimens.SPLIT RNA Extraction Kit (Lexogen, cat # SKU: 008.48) was used for Asellidae and Trichoniscidae.Sphaeromatidae were first homogenized in DNA/RNA Shield (Zymo Research, cat # R1100) and RNA was extracted with Quick-DNA/RNA mini prep (Zymo Research, cat.# D7001).The quality of the extracted RNA was rigorously assessed through a combination of several methods: agarose gel electrophoresis and Agilent 2100 Bioanalyzer with RNA 6000 Nano Kit (Agilent Technologies, cat # 5067-1511) for RNA integrity, and spectrophotometry with SPECTROstar Nano Microplate Reader (BMG LABTECH, Germany) for purity.Subsequently, RNA samples underwent TURBO DNAse treatment (Thermo Fisher Scientific, cat #AM2239) until there was no detectable amplification of the marker gene.PCR amplification was performed with GoTaq ® G2 Green Master Mix (Promega, cat # M7822,) and the 16 S primers (for Asellidae and Sphaeromatidae 16Sbr-L 5′-CGCCTGTTTATCAAAAACAT-3′ and Stena_R1 5′ -CGTGGAAGTTTAATAGTCGAACAGAC-3′; for Trichoniscidae 16 SarL 5′-CGCCTGTTTATCAAAAACAT-3′ and 16H2 5′-AGATAGAAACCAACCTGG-3′). The thermal cycling protocol consisted of: an initial denaturation at 95 °C for 2 min, followed by 35 cycles of denaturation at 95 °C for 45 seconds, annealing at 52 °C for 45 seconds, and extension at 72 °C for 45 seconds, concluding with a 5-minute final extension at 72 °C.RNA concentrations were measured using a fluorometric approach (QFX Fluorometer and DeNovix RNA Quantification Kit, cat.# KIT-RNA-2-NS, DeNovix).Three individual samples were pooled equimolarly resulting in a total of three biological replicates per each species.Directional or stranded libraries were generated using NEB library prep kits utilizing poly-A selection approach followed by PE150 (paired-end 150 base pairs) sequencing on the NovaSeq.6000 platform (Illumina) conducted by Novogen Europe, Cambridge, UK.
Raw reads processing and de novo transcriptome assembly.Raw sequencing reads were quality-checked with FastQC 29,30 and MultiQC 30 and trimmed for low-quality bases using Trimmomatic 31 .Cleaned reads were rechecked for quality prior to further analysis.If not stated otherwise, all the downstream analyses were conducted on the computer cluster Isabella (University of Zagreb).Species-specific de novo transcriptomes were generated from each dataset using Trinity 32 software.transcriptome quality assessment and mapping.Transcriptomes were analyzed for basic statistics with Transrate 33 and FastaStatistics 34 .We assessed the completeness of each de novo assembly using the Benchmarking Universal Single-Copy Orthologs (BUSCO) software 35 .It involved searching for the presence of eukaryote 'core' genes in each assembly, with the Arthropoda database serving as the reference (dataset: arthrop-oda_odb10 (2020-09-10, 90 genomes, 1,013 BUSCOs).
Self-mapping of reads of each biological replicate against the respective de novo assembly, as another measure of quality evaluation, was conducted using Salmon software 36 .
Expression matrices were computed using the Perl script abundance_estimates_to_matrix.plcontained in the Trinity 37 package for each set of biological replicates separately using Salmon quantification files as inputs.Pairwise Pearson's correlation coefficients between biological replicates within each species were calculated on TPM matrices in R studio 38 using cor: Correlation, Variance and Covariance (Matrices) function and visualized with packages reshape2 39 and ggplot2 40 .
protein prediction, functional annotation and orthogroup inference.TransDecoder v5.7.0 (https://github.com/TransDecoder/TransDecoder)was employed to identify open reading frames (ORFs) and predict candidate coding regions, discarding possible non-coding RNA and DNA contamination.Translated transcripts with all types of coding regions (terminal, internal and complete coding sequences) were functionally annotated with ultra-sensitive mode in Diamond 41 at European Galaxy Serve using BLASTX and BLASTP methods against UniProtKB/Swiss-Prot databases (last update in March 2023).Homology with PFAM common protein domains 42 was evaluated using profile hidden Markov models or HMMER tool 43 .Transmembrane domains in protein sequences were predicted with TMHMM tool [44][45][46] on European Galaxy Server (ExpAA > 80).To elucidate the functional roles of identified transcripts, Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted with EggNOG mapper 47,48 using web interface (http:// eggnog-mapper.embl.de/).Finally, OrthoFinder 49,50 was employed to analyze protein sequences for phylogenetic orthology inference among 11 isopod species.Species tree has been inferred by using the STAG 51 , rooted using the STRIDE algorithm 52 and visualized using FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/).

technical Validation
RNa and libraries quality control.Only RNA samples with confirmed high quality and integrity (concentration ≥20 ng/μL and flat base line on Bioanalyzer) were used for pooling.It is important to note that RIN number as a measure of RNA integrity can't be used with isopods since this group, as arthropods in general, shows different numbers of peaks due to presence of hidden breaks in rRNA 66 .Prior to multiplexing, libraries were checked for fragment distribution and concentration to ensure all sequencing criteria have been met (done by Novogene Sequencing company).Read and de novo assembly basic statistics.Sequencing yields, assembly and annotation statistics, completeness and mapping results are shown in Table 2. Briefly, a total of ~180 gigabases (Gb) of raw data were obtained, which is approximately 5.45 Gb per sample on average.The raw paired-end reads ranged between 19,6 to 27,7 million and consisted of a high-quality Q30 score (base error <0.1%) (median Q30 = 93%).FASTQC results indicated that cleaned reads passed minimum quality standards.The number of retained reads after filtering and adapter removal exceeded 96% which ensured high level of mapping with an average of 90.81% (88,8% -93,1%) reads mapped to their respective species-specific transcriptome.Since 80% read  mapping is an indication of a reliable assembly according to the standards set by the Trinity protocol, these results demonstrate that our de novo assemblies are of high-quality (Fig. 3).Intraspecies Pearson correlation coefficient between biological replicate pairs was calculated on TPM normalized count matrices (Fig. 3) and it shows good reproducibility and reliability of the experimental design for at least 7 species (Pearson coefficient is 0,95 or higher).Three species (Proasellus coxalis s.l., P. anophtalmus and A. balthasari) have one sample each which shows lower correlation with the rest of the samples (less than 0,80).
The assemblies consisted of 127.009 to 281.607 transcripts (median 152.531).Mean contig length is ranging from 927 to 1745 and contig L50 from 19,539 to 42.128.All transcriptomes had N50 values that exceeded 1,000 bp.The smallest transcript was 272 bp long (in P. hercegovinensis).The longest transcript across all eleven assemblies was between 22.374 and 35.440 bp long.
Overall results indicate that sequencing has yielded in high-quality RNAseq datasets for 11 isopod species.
protein prediction and annotation.At least 51% of the transcripts had ORFs with maximum of 78,8% of all transcripts with coding regions in Hyloniscus beckeri transcriptome.Predicted proteins had on average minimum lenght of 85 aa (amino acids) and mean lenght of 362 aa (with maximum of 11.334 aa in A. balthasari).Out of these predicted proteins, on average 46% ORFs were complete and include START and STOP codon.Annotation analysis revealed that the percentage of coding sequences with significant BlastP hits was similar across all eleven assemblies, with at least 50% in UniProt/SwissProt database, 36% in GO and 32% of sequences being annotated in KEGG.Moreover, at least 60% had a recognizable protein domain and on average 4% of coding sequences had predicted transmembrane helices which makes them potential integral membrane proteins (Table 2).
Orthology prediction and phylogenetic relationship confirmation.Orthofinder 49,50 was utilized to detect potential orthologs and group proteins into orthogroups for the eleven species (Table 3).Only the longest predicted protein by Transdecoder was used in the analysis.A total of 848.798 predicted proteins (89,9%) were assigned to 85.288 orthogroups.8,673 orthogroups (10.2%) were found to be shared among all 11 species.100,771 predicted proteins were identified as species-specific, and were categorized into 24,687 inferred orthogroups.Principal component analysis (PCA) and hierarchical clustering of the TPM expression levels of 15 single-copy orthologues among all isopod species revealed that samples primarily clustered according to the species with first two components explaining more than 72% of the variance (Fig. 4).
Orthofinder also generated a species tree based on 8.673 orthogroups with orthologues present in all species 51,52 .In the ortholog phylogram (Fig. 5), a distinct separation among three suborders (Asellota, Oniscidea and Sphaeromatidae) is evident, aligning with expectations 67 .Species from the Monolistra and Proasellus genera constitute a monophyletic group.Cave representatives are clearly distinct in all three suborders.There are no molecular family-level phylogenies of Trichoniscidae and Sphaeromatidae, but recent molecular phylogeny studies support our phylogenetic reconstructions for the family Asellidae 19 .It is important to note that this phylogram was not constructed based on an evolutionary model but solely relies on the substitution rates of single-copy orthologs.

Fig. 1
Fig.1Outline of the experimental workflow and the analysis pipeline.

Fig. 3
Fig. 3 Transcriptome quality assessment.(a) Heatmaps of Pearson correlation coefficients between the three replicates for every species.(b) Benchmarking Universal Single-Copy Orthologs (BUSCO) scores of the 11 de novo transcriptomes.(c) Boxplots showing read mapping ratios of RNA-seq data to the de novo transcriptomes (three replicates for each species were mapped and the dashed line indicates the average read mapping ratio of the 33 RNA-seq samples).

Fig. 4
Fig. 4 Principal component analysis (PCA) of the expression levels of 15 single-copy orthologues among 11 isopod species.The proportion of variance explained by each principal component is provided in parentheses along each axis.

Fig. 5
Fig.5 Orthology analysis across the sampled isopod species.(a) The consensus species tree based on gene trees of 8.673 orthologues present in all species (the support value for each bipartition is the number of individual species trees that contained that bipartition; black and white circle represents surface and cave species, respectively).(b) The number of coding sequences which are unassigned to any orthogroup (orange), speciesspecific (yellow) or shared in at least another species (green).